This notebook demonstrates an Environment‐Wide Association Study (EnWAS) work flow shown in Figure 1.1.
Figure 1.1: EnWAS Work Flow
In this example, we are interested in finding nutrition predictors associated with diastolic blood pressure. To show work with the above steps, we will perform a linear regression model and a linear regression with applying natural spline functions on the continuous variables. In short, a spline-based linear regression model can capture the non-linearity trend of the data by using piece-wise polynomial functions on continuous variables. More information about spline can be found on wikipedia
We will use the NHANES as our data source more details about data can be found on the CDC website. A quick start notebook about accessing the data in our docker database with provided R function shown in here.
In this demonstration, we choose diastolic blood pressure as the outcome interest. These data are contained in tables in the NHANES database.
For each survey a number of people were chosen to provide blood pressure measurements and these were obtained in duplicate on two separate days. To get an more accurate measurement of the diastolic blood pressure, we average two times of measurements for each participants.
Based on our literature review we feel that sex, age, ethnicity, education and BMI are likely to affect blood pressure, and we choose them as the confounders to adjust for. In a more realistic analysis we would likely also want to find out who is taking medications that are intended to alter blood pressure, but for simplicity of the exposition we will ignore that issue in this analysis.
Normally the model fitting and QA/QC process would be carried out in an exploratory and iterative fashion, however that is not easy to capture in a static document. Instead we will compare two basic models, one where we assume that there is a straight line relationship between continuous confounders and our response, and a second approach where we use spline models. This approach has some similarities to the Generalized Additive Model (GAMs) and much of the literature there is useful. There are some differences though, and we will outline them as we go.
We load the NHANES data using functions from PHONTO package. The PHONTO package provides a set function to facilitate researchers to get access NHANES data in our docker database. More detail about data query and search tools can be found on quick start page.
In addition, the phesant() function allows the user to check the data inspired by the PHESANT package.
cols = list(DEMO_C=c("RIDAGEYR","RIAGENDR","RIDRETH1","DMDEDUC2"),
DEMO_D=c("RIDAGEYR","RIAGENDR","RIDRETH1","DMDEDUC2"),
DEMO_E=c("RIDAGEYR","RIAGENDR","RIDRETH1","DMDEDUC2"),
BPX_C=c('BPXDI1','BPXDI2'),
BPX_D=c('BPXDI1','BPXDI2'),
BPX_E=c('BPXDI1','BPXDI2'),
BMX_C="BMXBMI",
BMX_D="BMXBMI",
BMX_E="BMXBMI"
)
base_df <- jointQuery(cols)
base_df |> DT::datatable(options=list(pageLength=5))
#> Warning in instance$preRenderHook(instance): It seems your data is too big
#> for client-side DataTables. You may consider server-side processing: https://
#> rstudio.github.io/DT/server.html
Transform education and BMI according CDC
# remove age under 20 and diastolic with 0s
base_df = base_df |> subset(RIDAGEYR>20 & BPXDI1!=0 & BPXDI2!=0 )
# assign the gender and ethnicity to the real levels
# Transform education to:
# < High School
# - High School
# > High School
trans_DEDUC = ifelse(stringr::str_detect(base_df$DMDEDUC2,"9"), "<HS",
ifelse(stringr::str_detect(base_df$DMDEDUC2,"High School"), "HS",
ifelse(stringr::str_detect(base_df$DMDEDUC2,'College'), ">HS", NA)))
base_df$DMDEDUC2 <- as.factor(trans_DEDUC)
# transform the BMI to:
# underweight range.
# healthy weight range.
# overweight range.
# obesity range.
trans_BMI = ifelse(base_df$BMXBMI < 18.5, "underweight",
ifelse(base_df$BMXBMI < 25, "healthy",
ifelse(base_df$BMXBMI < 30, "overweight",
ifelse(base_df$BMXBMI >= 30, "obesity",
NA))))
base_df$BMXBMI = as.factor(trans_BMI)
base_df$years = as.factor(paste0(base_df$Begin.Year,"-",base_df$EndYear))
base_df = na.omit(base_df)
# Take average first and second read for the diastolic blood pressure.
base_df$DIASTOLIC <- (base_df$BPXDI1+base_df$BPXDI2)/2
To find plausible models, we should build models and perform QA/QC process iteratively. In the following demonstrations, we built a linear regression model and another linear model with apply natural spline function on the continuous variables. The outcome is diastolic is the average of the diastolic first (BPXDI1) and second (BPXDI2) reads, gender(RIAGENDR), age (RIDAGEYR), ethnicity (RIDRETH1), BMI(BMXBMI) and education(DMDEDUC2).
ns_str <-
'DIASTOLIC ~ ns(RIDAGEYR, knots = seq(30, 80, by = 10), Boundary.knots=c(20,90)) * RIAGENDR + BMXBMI + RIDRETH1+DMDEDUC2+years'
ns_base <- lm(formula = as.formula(ns_str), base_df)
We need to check the base model and ensure it runs correctly before performing EnWAS. However, the classical methods such as Q-Q plots, residual plots, and goodness of fit (GoF) tests are generally ill-suited. We want to provide the following plots to demonstrate the criteria of QA/QC.
It is clear to see that the diastolic blood pressure has a parabola-like shape relation with the age, as shown in Figure 4.1. The yellow and blue lines are generated by smooth prediction from linear and spline models. The dots are randomly sampled in 20% of the data points.
Figure 4.1: Linear vs Spline
The transnational methods cannot detect the linear model fit data poorly.
We can check the residuals against the fitted value with a smooth scatter plot. And we find that there are no apparent trends for both models, even though the spline model has fewer mean square errors.
A possible solution to check the “goodness of fit (GoF)” is to check whether apparent trends in the plots of residual against terms in the models. We can spot a slight trend residual in the BMI range from 20 to 40, indicating that using linear regression on BMI term may not hurt the model performance too much. However, a strong parabola-like trend can be observed in the residuals of the linear model with respect to ages, which indicates that the linear model cannot capture age. In other words, the model is not good enough to be a base model to run EnWAS; the findings are more likely false positives if using such a base model. On the other hand, the residuals spline model has no clear trends with respect to both BMI and age, which means the base model captures the relations of outcomes (diastolic) and the known confounders.
We can further check the base models with binned plots, which can be helpful when the data set is large. The binned plot is a way that “zoom in” to look at the trends.
Figure 4.2 shows QA/QC plots.
Figure 4.2: Quality Assurance and Quality Control
Identify a set of phenotypes and exposures that we want to test for association with the response variable of interest.
In this example, we choose phenotype from the dietary data, we identified the phenotypes and saved them into a built-in variable call exposure_vars in EnWAS package. “The objective of the dietary interview component is to obtain detailed dietary intake information from NHANES participants. The dietary intake data are used to estimate the types and amounts of foods and beverages (including all types of water) consumed during the 24-hour period prior to the interview (midnight to midnight), and to estimate intakes of energy, nutrients, and other food components from those foods and beverages.” More detail can be found code book.
cols = list("DR1TOT_C"=exposure_vars,"DR1TOT_D"=exposure_vars,"DR1TOT_E"=exposure_vars)
diet_data = unionQuery(cols)
dim(diet_data)
#> [1] 29355 52
Once we load the dietary data, we should also check it with the PHESANT function and preprocess it if necessary. We may need to eplore the missing data for the above data according to the notebook. Currently, we only keep the columns that has less missing rate less than 15%. The observations with missing values in the covariates in the baseline model or current phenotype will be remove during EnWAS.
In NHANES, the phenotypes are often right-skewed with a very long tail on the right side of the distribution, which can be addressed with logarithm transformation followed by a z-transformation. However, it would take a tremendous effort to manually and exhaustively inspect the distributions and figure out appropriate transformation methods for all kinds of phenotypes with different types of distribution when dealing with extensive data sets. Therefore, we recommend using inverse normal transformation (INT) for all continuous variables because INT can apply various distributions. We compared the EnWAS results of logarithm transformation followed by a z-transformation with normal inverse transformation.
Figure 5.1 shows an example of Total fat (DR1TTFAT) with non transformation, log-z transformation, and inverse normal transformation.
Figure 5.1: Transformation for Total fat (gm)
In this step, we need to carry out a set of regression models, one for each exposure/phenotype in Step 5, and report on the analysis.
data = merge(base_df,diet_data,by = "SEQN")
data = na.omit(data)
xwas = enwas(ns_str,exposure_vars,data)
#> Warning in is.data.frame(x): NAs introduced by coercion
#> Warning in is.data.frame(x): NAs introduced by coercion
#> Warning in is.data.frame(x): NAs introduced by coercion
xwas_log <- enwas(ns_str,exposure_vars,data,trans = "log")
xwas_inv <- enwas(ns_str,exposure_vars,data,trans = "inv")
Figure 6.1 shows the estimates and CI of the exposure variables and only displays the top 10 (if they have more than 30 satisfy the filters) ranked by absolute values of the estimates. The variables with their CI containing zeros are also removed.
forest_plot(xwas$enwas_res,10) # filter out CI contains 0
Figure 6.1: Forest Plot for non tranformation EnWAS
Figure 6.2 shows the top 10 exposures, ranked by the differences in the estimates for the same variables.
- ns denotes the variables non-transformed, but the estimates with beta^hat * SD(X)
- ns_inv denotes variables transformed inverse normal transformation
- ns-log denotes variables transformed with log followed by z-transformation
Figure 6.2: Multiple Forest Plots
The following scatter plot shows the inverse normal transformation estimates against estimates (beta^hat * SD(X)) of nontransformed variables. The top 20 has added text for the variables, but it is pretty clear to show the information.
Figure 6.3 estimates of EnWAS non-transformed (EnWAS) with inverse normal transformation (EnWAS INT). The the top 20 most difference variables label with variable names.
Figure 6.3: Scatter Plot for EnWAS against EnWAS with INT
We can also compare the non-transformed with log-z transformation, and log-z transformation with inverse normal transformation.